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Abstract 

We present a novel framework, called Private Disclosure of 
Information (PDI), which is aimed to prevent an adversary 
from inferring certain sensitive information about subjects us¬ 
ing the data that they disclosed during communication with 
an intended recipient. We show cases where it is possible 
to achieve perfect privacy regardless of the adversary’s aux¬ 
iliary knowledge while preserving full utility of the informa¬ 
tion to the intended recipient and provide sufficient conditions 
for such cases. We also demonstrate the applicability of PDI 
on a real-world data set that simulates a health tele-monitoring 
scenario. 


1 Introduction 


Data collection and sharing is growing to unprecedented vol¬ 
umes. Some of the reasons for this phenomenon include the 
decrease in storage cost, the rise of social networks, the ubiq¬ 
uity of smartphones and law regulations. For example, in 
many states in the US, medical institutions are obliged to make 


Sweeney 

2002} 

OSHPD, 2014)1. 

Warner 

(1965 

1 argues that the lack of privacy guarantees 


can cause subjects to be reluctant to share their data with data 
collectors (such as doctors, government agencies, researchers, 
etc.) or even result in subjects providing false information. 
Therefore, subjects need to be assured that their privacy will 
be preserved throughout the whole process of data collection 
and use. 

One of the emerging areas with growing interest to collect 
sensitive personal and private data is health tele-monitoring. In 
this setting, a technology is used to collect health-related data 
about patients, which are later submitted to a medical staff for 
monitoring. The data are then used to assess the health status 
of patients and provide them with feedback and/or interven¬ 
tion. Research indicates that such technologies can improve 


readmission rates and lower overall costs (Clark et al. 


2007 


Chaudhry et al. 2010[ Inglis 2010[ Giamouzis et al. 


2012 


[Aranki et al.[ [2014) 1. In such scenarios, the collected data are 
usually of sensitive nature from a privacy point of view and 
therefore privacy preserving technologies are needed in order 
to protect patients’ privacy and increase compliance. 

There are multiple stages in the life-cycle of data, including 
i) the disclosure (or submission) of the data by the subjects to 
the data collector; ii) the processing of the data; Hi) the anal¬ 
ysis; and/or iv) the publishing of (often a privatized version 
of) the data or some findings based on them. In this paper we 
focus on the phase of disclosure of privacy-sensitive data by 
the data owners. Our framework for Private Disclosure of In¬ 
formation (PDI) is thus aimed to prevent an adversary from 
inferring certain sensitive information about the subject using 
the data that were disclosed during communication with an in¬ 
tended recipient. This is analogous to the problem of attribute 
linkage in statistical database privacy. 

In traditional encryption approaches to maintaining privacy, 
it is often implicitly assumed that the data themselves are the 
private information. However, in more general scenarios, the 
data can be used to infer some private information about the 
subjects for which the data apply. For example, respiration rate 
by itself might not be considered private information. How¬ 
ever, if the data from the collected respiration rate are used to 
infer whether the individual is a smoker or not, they become 
sensitive information. One can argue that because the informa¬ 
tion about whether someone smokes is private, the respiration 
rate data become private by implication. 


Under such circumstances, one should attempt to privatize 
the transmitted data in a way that reveals as little as possible 
about the private information to an adversary. In summary, our 
objective is to encode the transmitted data in order to hide an¬ 


other private piece of information. In the words of Sweeney 
( 2002| l: “Computer security is not privacy protection.” The 
converse is also true, privacy does not replace security. Our 
approach is therefore to be viewed as complementary to clas¬ 
sical security approaches. For example, data can be privatized 
then encrypted. 

The rest of this paper is organized as follows. In Section]^ 
we provide a survey of the literature for related work. In Sec- 
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tionj^we provide the motivation to the problem and formulate 
it, followed by further analysis in SectionWe then discuss 
implementation details of the learning problem in Section 
followed by experimental results in Section Finally, we 
close by discussing our conclusions and future research direc¬ 
tions in Section|7] 


2 Related Work 


The study of privacy-preserving techniques and technologies 
in the fields of statistics, computer security and databases, and 
their intersections, dates back to at least |1965| when |Warner| 
proposed a randomization technique for conducting surveys 
and collecting responses for the purpose of statistical and pop¬ 
ulation analysis. Since then, extensive privacy research in 
the fields above was conducted. Therefore, in the interest of 
brevity, we provide a brief overview of the areas of study re¬ 
lated to our work and refer the reader to more comprehensive 
surveys in each area. 

Recently, attention to privacy has been rising in the health¬ 
care domain with the spread of electronic health-records us¬ 
age and the growing data sharing between medical institutions. 
It has been reported that consumers are expressing increasing 


concerns regarding their health privacy (Bishop et al. 2005 


Hsiao and Hing| 2012| l. Most of the research in privacy from 
the health community focuses on medical data publishing and 
is therefore database-centric. For a survey of results in this do¬ 
main, we refer the reader to ( [Gkoulalas-Divanis et al.|[20141 l. 

In more general-purpose scenarios, the privacy of statisti¬ 
cal databases and data publishing has been extensively stud¬ 
ied. Denning and Schlorer ( 1983| l presented some of the 
early threats related to inference in statistical databases and 
reviewed controls that are based on the lattice model ( |Den-| 
|ning||197^ . [Duncan and Lambert] ( ri989||1986l l studied meth¬ 
ods for limiting disclosure and linkage risks in data publishing. 


Sandhu (19931 provided a tutorial on lattice-based access con¬ 


trols for information flow security and privacy. Later, Farkas 


and Jajodia ( 2002| l provided a survey of more results in the 
field of access controls to the inference problem in database 
security. For rigorous surveys in the fields of data publishing 
privacy and statistical databases privacy, we refer the reader 
to (Adam and Worthmann 1989[ Fung et al. 2010| l. 

Two semantic models of database privacy of growing inter¬ 
est in the privacy literature are fc-anonymity (|Sweeney| 2002jl 
and differential privacy (Dwork 2006 20081. In fc-anonymity. 


given a set of quasi-identifiers that can be used to re-identify 
subjects, a table is called fc-anonymous if every combination 
of quasi-identifiers in the table appears in at least k records. 
If a table is /c-anonymous, assuming each individual has a 
single record in the table, then the probability of linking a 
record to an individual is at most 1/k. Other extensions and 
refinements of fc-anonymity have been proposed including l- 
diversity (Machanavajjhala et al. 2007|l, f-closeness (Li et al. 


|2007| l and others. 

In differential privacy, the requirement is that the output 
of a statistical query should not be too sensitive to any sin¬ 
gle record in the database. Formally, given a statistical query 
M, then M is e-differentially private if P(M (Di) G S) < 
X P (M {D 2 ) £ S) for any two realizations Di and D 2 of 
the database such that |i9iAi92| = 1 and all S C Range{M), 
where D 1 AD 2 is the symmetric difference between Di and 
D 2 ( |Dwork| [2006| |20()8] l. |Cormod^ ( |201 l| l showed that sensi¬ 
tive attribute inference can be done on databases that are dif¬ 
ferentially private and Ldiverse with similar accuracy. 

As can be seen from the review above, most of the research 
in data-privacy is focused on privacy-preserving data publish¬ 
ing and privacy-preserving statistical databases. In contrast, 
in this work we focus on preventing adverserial statistical in¬ 
ference of a piece of private information based on the dis¬ 
closed messages in an individual’s information exchange sce¬ 
nario during communication. 

3 Problem Formulation 

3.1 Notation 

We use the following shorthand notation for probability den¬ 
sity (mass) functions. We always use a pair of a capital and a 
small symbols of the same letter for a random variable and a 
realization of it, respectively. For notation simplicity and con¬ 
ciseness, given random variables X and Y, instead of writing 
Px{x) for the marginal density (mass) function of X we sim¬ 
ply write p( a;), and instead of writing pjc I I y) for the condi¬ 
tional density (mass) function of X given Y, we simply write 
p{x\y). 

3.2 Motivation and Threat Model 

We are primarily motivated by the tele-monitoring setting. In 
this setting, a doctor wishes to monitor her patients remotely 
using a technology that can collect and transmit health-related 
data. The shared data are of sensitive nature because they can 
be used to infer private pieces of information like a health- 
condition or a disease. For example, updates about a patient’s 
weight can lead to disclosure of obesity as it will be demon¬ 
strated in Section|6l 

More generally, an information provider Bob wants to dis¬ 
close a piece of information x to some recipient Alice. Fur¬ 
thermore, the information x can be used to infer some private 
information c about Bob. However, there is no guarantee that 
the transmitted information will not be intercepted and poten¬ 
tially used for inference of the private information c about Bob 
by an untrusted but passive eavesdropper Eve. Finally, in this 
setting, we assume that Alice is more certain about c than Eve 
is. The problem at hand is delivering the information x under 
these circumstances such that Alice can make full use of the 
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Figure 1: The Graphical Model of PDI 


information but that Eve’s ability to infer c about Bob, using 
the transmitted message, is minimized. 

As a concrete example, consider the following scenario in 
health tele-monitoring. A patient Bob is trying to update his 
physician Alice about his weight and body mass index (BMI)[^ 
Since Alice is Bob’s physician, she already knows the weight 
status category of Bob which he considers to be private infor¬ 
mation]^ Eve, however, does not know Bob’s weight status 
category a priori but would like to learn it from the messages 
he sends to Alice. If Eve succeeds to listen in on the communi¬ 
cation between Bob and Alice, Eve can, with some accuracy, 
infer the weight status category of Bob. Alice, being a con¬ 
siderate physician, wants to ensure the privacy of her patients. 
Alice decides to create an encoding scheme (that can be made 
public) for the communication such that the encoding is differ¬ 
ent per weight status group. Her objective is to make this en¬ 
coding scheme “as privacy-preserving as possible” in the sense 
of keeping her patients’ weight status category information as 
private as possible to someone who does not know it a priori. 

It is important to compare this scenario with the classical 
security approach. In classical security, the objective is to pro¬ 
tect the transmitted message itself without taking into consid¬ 
eration an adversarial effort to statistically infer private infor¬ 
mation using the cipher-text. It has been demonstrated that sta¬ 
tistical inference can still be performed on encrypted data (Eor 
example White et al. 2011 [[Miller et al. 2014| l. We comple¬ 
ment this by capturing the notion of statistical inference of the 
private information c from the transmitted data, and aim to find 
a way to minimize the ability of an adversary to infer c using 
the transmitted data. 


3.3 Problem Definition 

Towards a more formal representation of the problem, we con¬ 
sider scenarios where i) Bob’s identity, s, is attached to any 
message that is sent by him; ii) there is no guarantee that the 
sent information will not be intercepted by an untrusted but 
passive eavesdropper Eve; Hi) the information x can be used 
to infer some private information c about Bob; and iv) Alice 
knows the private information c about Bob but Eve does not. 

*BMI is a measure of relative weight based on an individual’s mass and 
height. Defined as BMI = ' 

^Weight status category indicates if an individual is underweight, over¬ 
weight, obese or has a healthy weight. 


Under these assumptions. Bob would like to exploit the fact 
that Alice knows c but Eve does not in order to send a mes¬ 
sage z that is more useful to Alice than Eve. The utility value 
of the message follows the following decoding and “hiding 
class” (HC) premises: 

DECODING Alice can make full use of the sent information 
z, i.e. obtain the original message x from the transmitted 
message z; and 

HC Eve’s ability to make inference about c given s, based on 
the sent information z is minimized. 

Eormally, we use S for the set of identifiers of information 
providers, X for the information space and E for the set of 
private classes (the private information about the information 
providers). Similarly, we define the random variables S for 
the identifier of the information provider, X for the piece of 
information that the provider would like to disclose, C for the 
class that the provider belongs to and Z for the encoded mes¬ 
sage that will be sent (called privatized information), which 
is a function of the original information and the class. We 
call this function a privacy mapping function and define it as 
R : T, ^ X— where I— is the set of injective functions I —> I. 
A simple way to think about R is as an encoding scheme. That 
is, for every class c G E, it outputs an encoding function for 
the input information x. Given c G E, since R (c) is injective, 
then there exists a left inverse if (c) which will be used to de¬ 
code the messages z sent from subjects in class cj^Erom that, 
Z is simply equal to [i?(C')] (AT). The statistical model that 
relates these random variables is described in Eigure[T] 

Eor conciseness, in this paper we treat the case of contin¬ 
uous information spaces. Note that in the case of a discrete 
information space, the reader is instructed to follow the discus¬ 
sion by substituting probability density functions with proba¬ 
bility mass functions for the distributions of X and Z. Note 
that our treatment also covers the case of information spaces of 
mixed nature (that are discrete in some attributes and continu¬ 
ous in others) by using the appropriate probability distribution 
functions. 

Eor the model in Eigure[2 one needs to supply the following 
probability distributions. p(s), the prior of subjects transmit¬ 
ting messages in the system. p{c\s), the adversary’s prior of 
class membership for the different subjects (based on auxil¬ 
iary knowledge). p{x\c, s), the generative model of data given 
a class and a subject. Einally, p{z\x,c) is simple and can 
be modeled as P (Z = z\X = x,C = c) — 1 if and only if 
z = [i?(c)] (x) and 0 otherwise, for all z,x G X and c G E. 

Recall that the identity s of the information provider is at¬ 
tached with the transmitted message. Moreover, the intended 
recipient knows the class c of the information provider. There¬ 
fore, because of the injectivity requirement of the privacy map¬ 
ping function, the intended recipient can decode the sent infor- 

^We say that g : D2 —> Di is a left inverse of a function / : Di —>■ D2 
if for all X £ Di we have g (/ (a;)) = x. 
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mation z back to the original message x. Hence the require¬ 
ment ( IDECODINGl i is satisfied. 

Finally, in order to satisfy the second requirement (HCI we 
would like to find a privacy mapping function R that min¬ 
imizes the amount of information that the privatized infor¬ 
mation Z carries for the sake of inferring the private class 
C, given the subject identifier S, to an adversary. We adopt 
the measure of (conditional) mutual information to model this 
quantity. We present the definition of conditional mutual infor¬ 
mation for continuous random variables, and refer the reader 
to ( |Cover and Thomas) |2006[ Definitions 2.61 and 8.54) for 
the corresponding definitions concerning discrete random vari¬ 
ables and random variables that can be mixtures of discrete and 
continuous, respectively. 


Definition 1 ( [Cover and Thom^ |2006| c.f. Definition 8.49). 
Let X, Y and Z be random variables. The conditional mutual 
information of X and Y given Z, I{X, Y\Z), is defined as 


I(X,Y\Z)^Ep^,^y^,) 


log 


Pix,y\z) 


pix\z)p(y\z) _ 


Intuitively, I{Z,C\S; R) measures in bits, the expected 
amount of mutual information that the random variables Z = 
[i?(C')] (2f) and C have, given the information in Mutual 
information also provides a sufficient and necessary condition 
for conditional independence as follows. 


Lemma 1 ( Cover and Thom^ [2006 c.f. Corollary 2.92; c.f. 
Theorem 8.6.1). I{Z, els'; R) > 0 for any privacy mapping 
function R. Furthermore, I(Z,C\S]R) = 0 if and only if Z 
and C are conditionally independent given S using the privacy 
mapping function R. 


From the intuition above, and the fact in Lemma [T] we set 
our objective to find a privacy mapping function R that mini¬ 
mizes the conditional mutual information of the privatized in¬ 
formation Z and the private class C given the identity of the 
information provider S such that the model in Figure [T] holds. 
In short. 


i?* =argmin I{Z,C\S;R) (1) 

R 

subject to i? is a privacy mapping function 
and Model in Figure 

Once a privacy mapping function R is chosen, the commu¬ 
nication process can be carried as follows. 

Sending The transaction of disclosing a piece of information 
a; € I by an information provider belonging to class c G 
E is performed by applying the following transformation 
z ^ [S(c)] (x) and sending z (or some encrypted version 
of it). 

Receiving The transaction of receiving a piece of information 
z G I sent by an information provider belonging to class 

'^The units are bits assuming the log base in DefinitionJ^is 2. 


c € E is performed by applying x G- [f?^(c)] (z). Where 
i?^(c) is a left inverse of R{c). 

Note that the problem in Equation ([T]) is not a convex prob¬ 
lem. Furthermore, it is of interest to study how to learn the 
model in Figure and find an optimal privacy mapping func¬ 
tion R from data. We will address this question in Section]^ 
but first we further study the properties of the formulated 
framework in the following section. 


4 Further Analysis 

First, we relate the value of the objective function in Equa¬ 
tion 0 to Bayesian inference in the following lemma. 

Lemma 1 . If a privacy mapping function R yields 
I{Z,C\S]R) = 0 then Bayesian inference of C based on Z is 
prevented for the adversary. 

Proof. Erom Lemma [T] we know that Z is conditionally inde¬ 
pendent of C given S which means p(c\z, s) = p(c|s) which 
is the prior of the class membership that the adversary already 
possesses. Therefore, the disclosure of Z does not change the 
adversary’s belief regarding the private information C given 
the subject identifier S. □ 

The next question that we need to ask is whether a privacy 
mapping function R satisfying I{Z, els'; R) = 0 is ever at¬ 
tainable. There are three reasons for this question. First, if 
such a privacy mapping function R exists, then it means that 
by knowing S (which is always attached to the message), Z 
provides no extra information to inferring C to an adversary, 
which sounds surprising. Second, there is generally a trade-off 
between information utility and privacy where optimal privacy 
is usually only attained at the cost of no utility ( |Dwork|[2006| l. 
In our case, the utility of the information Z to the intended 
recipient is always fully preserved, unrelated of the choice of 
R, since R (c) is injective for all c G E. From this it fol¬ 
lows that the scenario of perfect privacy seems to be unattain¬ 
able]^ Finally, if such a privacy mapping function R exists, 
it would assure optimality of Equation 0- Fortunately (and 
somewhat unintuitively), such a mapping function can be at¬ 
tained as shown in the following sequence of results. 

Lemma 3. If there exists a function f(z, s) such that 
p(z|c, s) = f{z,s) for all c G E,z G I and s G S then 
Piz\s) = fiz,s) 

Proof p{z\s) = EcgsF(^>c|s) = ■ p(c\s) = 

Ecgs s) • f(c|s) = fiz, s) ■ EcgeF(c|s) = /(^> s) □ 

Using Lemma 1^ we prove the following theorem, which is 
a sufficient condition for optimality of Equation 0. 

^We consider “perfect privacy” to be that the adversary’s belief about C 
given S doesn’t change after observing Z. 
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Theorem 1. If there exists a function f(z, s) such that 
p(z|c, s) = f{z,s) for all c S S I and s € S then 
Dkl ipic\z, s)| |p(c|s)) = 0 for all z and s € 

Proof. Since p{z\c,s) = f{z,s) then using Lemma we 
know that p{z\s) = f{z,s). Therefore, for any z G X and 
s G S such that f{z,s) = p{z\c,s) = p{z\s) 0 we 


p(c|z,s) _ p(z|e,s)-p(c|s) _ 

6 p(c|s) p(c|s)-p(z|s) 

Dkl {p{c\z,s)\\pIc\s)) = 0. 


p{z\c 
p(z|s) 


= 1 . 


This implies 

□ 


Corollary 1. If a privacy mapping function R achieves 
p(z|c, s) = f{z,s) for some function f(z,s), for all c G 
Tj, z G X and s G S then R is the optimal solution to Equa¬ 
tion Q. 

Proof The result follows from Theorem and the fact that 

I{Z,C\S-R) ^ Ep^,^,)[Dkl{.p{c\z,s-R)\\p{c\s-R))]. □ 

Note that Theorem[T]is independent of the model of p(c|s) 
(and p{s)). This is a very important observation since it means 
that in cases where a privacy mapping function R satisfies 
the condition of the theorem, modeling the adversary’s prior 
knowledge about information providers’ class memberships 
is not needed. Furthermore, such privacy mapping function 
achieves perfect privacy against any adversary, regardless of 
her auxiliary knowledge p{c\s) (orp(s)). In the following the¬ 
orems we provide examples of using Theorem[^that also serve 
as cases where such privacy mapping functions are attainable. 

Theorem 2. If X\C = c,S = s ^ A^(/ic,Sc) (Normal dis¬ 
tribution) for every c G T, and s G S, then [i?(c)] (x) = 

Sc ^ ■ (x — Pc) on optimal solution to Equation dljl. 

Proof It is easy to verify that Z\C = c,S = s N(Q,I) for 
every s G S and c S S, where 0 is the origin in the informa¬ 
tion space (vector of zeros) and I is the identity matrix (of the 
appropriate dimensions). This means that p(z\c^ s) = f{z, s) 
(not a function of c). By using Theorem[^ we therefore know 
that R is the optimal solution to Equation Q. □ 

The proofs of the following theorems are similar to this 
of Theorem |2] and were thus omitted for conciseness. 


Theorems. If X\C = c,S = s ^ Exp{Xc) (Exponential 
distribution) for every c G E and s G S, then [i?(c)] (x) = 
AcX is an optimal solution to Equation 0- 

Theorem 4. If X\C = c,S = s ^ Gamma{k,9c) (Gamma 
distribution with shape and scale parameters) for every c G 
S and s G S, then [i?(c)] (x) = ^ is an optimal solution 
to Equation 0- 

Theorem 5. If X\C = c,S = s ^ U{ac,bc) (Continu¬ 
ous Uniform distribution) for every c € S and s G S, then 
[i?(c)] (x) = f—ffy is an optimal solution to Equation Q. 

^ jCover and Thom^|2006| Definition 8.46): The Kullback-Leibler diver¬ 
gence is defined as Dkl (p| I ?) = Pp log ~ ■ 


5 Implementation 

In this section, we briefly describe an implementation of the 
learning problem th at is publicly available in the form of a 
MATLA^ toolbox ( [Aranki and Bajcsy 2015 i. In this imple¬ 
mentation, we investigate the question of learning a privacy 
mapping function R from a labeled data set T> = {{xi, Ci)i}. 
This implies a simplifying assumption of ignoring the model¬ 
ing of the random variable S corresponding to the identity of 
the information providers. This assumption has the following 
implications on the model in Figure [T] First, it implies that 
the adversary views information providers as uniformly dis¬ 
tributed, that is p(s) = for all s G S. Second, the assump¬ 
tion implies that the subject-class membership belief function 
of the adversary is equal for all subjects, that is p{c\s) = p{c) 
for all s € iS and c S E. As discussed in Section]^ in the 
cases where perfect privacy is achievable, the solutions are in¬ 
dependent of these models and therefore these implications are 
not limiting. Further study is necessary to assess the level of 
privacy-degradation incurred by this assumption in cases of 
imperfect privacy. Third, this assumption implies that the gen¬ 
erative model of data per class is independent of the subjects, 
that is p{x\c,s) = p{x\c) for all x S I, c € E and s G S. 
Finally, I{Z, CIS”; R) simplifies to I(Z, C; R). 

In order to make the problem in Equation 0 computation¬ 
ally tractable, a parametrized space for the privacy mapping 
functions can be introduced, allowing for the optimization to 
be performed on the parameter space. For example, consider 
the following parameter space 


0(n,E) = {(Ac,bc)ce^\ycGE : A, e 

bn G K", det(Ac) 0} 


Then a parametrized space for affine privacy mapping func¬ 
tions on the classes set E and information space X of dimen¬ 
sion n can be defined as 


loin, E,I) = {i?( •; e)\e G 0(n, E), R(-- 0) G (E ^ X^) , 
Vc G E : [i?(c; 9)] (x) = Ac ■ (x - be)} 

Provided a parameter search space 0, the optimization 
problem in Equation ([T]) can be re-written as 

0* = argmin/(Z, C; i?(-; 0)) (2) 

eee 

The straightforward way to modeling the required distribu¬ 
tions p{c) and p(x|c), from data, is non-parametrically by us¬ 
ing high-dimensional histograms. This approach, while simple 
to implement, suffers from the curse of dimensionality as its 
complexity grows exponentially with the dimension of the in¬ 
formation space. Once the models forp(c) andp(x|c) are con¬ 
structed, the model for p(z\c) can be computed for any choice 
of 0 G 0 allowing the computation of the objective function 

'https://www.mathworks.com/products/matlab/ 
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6.1.1 Inference Based on Original Data 


Table 1; BMI-for-age weight status categories and The corre¬ 
sponding BMI percentiles. 


Weight Category 

Underweight 
Healthy Weight 
Overweight 
Obese 


BMI Percentile Range 

BMI < 5% 

5% < BMI < 85% 
85% < BMI < 95% 
95% < BMI 


in Equation Q. Since the problem is non-convex, in order 
to optimize the objective function, we employ the genetic al¬ 
gorithm with the fitness function equal to the objective func¬ 
tion in Equation Q. The chosen selection policy is fitness- 
proportional while the chosen transformations (evolution/ge¬ 
netic) operators are both mutations and crossovers (|Banzhaf 
|etari[T998] l. 


6 Experimentation 


Using the data, we trained 3 SVM classifiers with Gaussian 
kernels. The classifiers are aggregate in terms of the “posi¬ 
tive” class in the following sense. The first classifier treats the 
“positive” class as the Underweight category (and so the “neg¬ 
ative” class is the rest of the categories). The second classifier 
treats the “positive” class as either the underweight or healthy 
weight categories. Einally, the third classifier treats the “posi¬ 
tive” class as any category except the obese category. We used 
a 40 — 60 split for training-testing. In numbers, we used 1371 
data points for training and 1984 data points for testing. 

The training for all SVMs was done using 10-fold cross- 
validation among the data in the training set to pick the best 
a of the Gaussian kernels and the best box boundaries of the 
classifiers. The classification phase is done by taking a ma¬ 
jority vote from the 3 classifiers and the output is the class 
which most classifiers agree on. The results of the classifier 
are described in Table |2] in terms of the confusion matrix of 
the different categories. The total accuracy of the classifier is 

88.31%13 


In this section we walk the reader through an example that 
aims to motivate and demonstrate PDI. In this example we use 
data that are published by the Center for Disease Control and 
Prevention (CDC) as part of the National Health and Nutrition 
Examination Survey of 2012|^ Specifically, we use the Body 
Measures (BMX_G) portion of the dataj^ 


6.1 Setting 

In our setting, we consider the disclosed information to be 
both Body Mass Index (BMI) and weight. Our information 
providers are assumed to be individuals of both genders that 
are 19 years of age or less. We consider the private informa¬ 
tion to be the weight status category of the subject. The CDC 
considers the following four standard weight status categories 
for the aforementioned age group i) underweight; ii) healthy 
weight; Hi) overweight; and iv) obese. There are 3355 data 
points in the data set with subjects of 19 years of age or less. 

According to the definitions of the CDC, the BMI category 
of a child or a teen is classified based on the individual’s BMI 
percentile among the same age and gender group as described 
in Table Since the age of the information provider is not 
part of the information space, the inference of the weight status 
category of the information provider based on BMI and weight 
is not perfect. The data for the different classes are depicted 
in Eigure|^ 


^https://wwwn.cdc.gov/nchs/nhanes/search/ 
nhanesll_12.aspx 

^https://wwwn.cdc.gov/nchs/nhanes/2 011-2012/BMX. 
G. htm 


6.2 Privatizing Information 


We would like to privatize the information at hand (BMI and 
weight) in order to maintain the weight status category as 
private as possible (based on the training set only). This 
scenario simulates a tele-monitoring scenario and fits the as¬ 
sumptions and motivation introduced in Section There¬ 
fore, we aim to utilize PDI in order to privatize the data 
as discussed earlier. In order to learn the privacy mapping 
function from the training data, we use the MATLAB tool¬ 
box mentioned in Section [5] ( |Aranki and Bajcsy 2015) 1. We 
used the affine privacy mapping functions for the parame¬ 
terized search space as shown in the example in Section 
Note that there are extra degrees of freedom in the problem, 
since any privacy mapping functions i?i and i ?2 related by 
Vc G E : i?2(c) = A • (i?i(c) — b) yield the same objec¬ 
tive value in Equation ([^ for any A G R’^^",det(A) ^ 0 
and b G M". That is, applying the same injective affine trans¬ 
formation to all encoding functions in R does not change the 
value of I{Z, els'; R). Therefore, in our problem we fix the 
encoding function of the “underweight” class to the identity 
function, i.e. [i?(“underweight”)] (x) = x. 

The resultant privatized information is depicted in Eigurej^ 
It is clear that it should be much harder to do inference of the 
weight category based on this privatized data, given the de¬ 
creased distinguishability between classes. Note that calculat¬ 
ing the privatized information is simple and efficient since now 
we know the parameters for the privacy mapping functions for 
the different classes. 


'"The adopted total accuracy measure is trace{M)/N where M is the 
confusion matrix and N is the cai'dinality of the test set. This is the percentage 
of true classifications over the test set. 
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Table 2: Confusion matrix before privatizing. UW = Un¬ 
derweight, HW = Healthy Weight, OW = Overweight, OB = 
Obese 


Table 3: Confusion matrix after privatizing. UW = Under¬ 
weight, HW = Healthy Weight, OW = Overweight, OB = 
Obese 
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Figure 2: BMI and weight for the different weight status 
groups. 

6.2.1 Inference Based on Privatized Data 

In order to evaluate the quality of the privatization, we now 
train new 3 SVM classifiers with the same training procedure 
as in Section [6.1.1| but this time using the privatized data (and 
of course, encoding the test set too for evaluation). Same as 
before, we then use a majority vote from the 3 classifiers to 
predict the class of any data point. The resultant confusion 
matrix is described in Tabled 

It is clear that the classification results are degraded after 
privatizing the information. The total accuracy dropped to 
66.03% (from 88.31%). Given that the data from different 
classes are highly indistinguishable, the classifier now clas¬ 
sifies most data points as “healthy weight”. This is to be ex¬ 
pected since most of the data points are in the “healthy weight” 
category. In informal words, if a classifier would have to make 
a “bet”, it would bet on the class with the most amount of data 
points. Formally, a lower bound on the total accuracy can be 
achieved by considering the trivial classifier that always pre¬ 
dicts “healthy weight” (deterministic), which has total accu¬ 
racy of 1270/1948 = 64.01%. This shows that our result of 
66.03% is not much further from a lower-bound guaranteed 
accuracy. 

Note that the data set is biased in size against the “un¬ 
derweight” category. There are only 126 data points with 
weight category “underweight” out of the 3355 total data 
points (3.76%). This makes privatizing that class particu¬ 
larly hard, especially because the modeling is based on n- 
dimensional histograms and is not parametric. For this reason 
the classification results before and after privatization for the 


Figure 3; BMI and weight for the different weight status 
groups after privatization. The difference between the two 
plots is the order of plotting the different classes (for visual 
clarity). 

“underweight” category are comparable. 

To intuitively demonstrate how privacy is preserved, we take 
a piece of privatized information at random from our data 
set, z = [77.17, 296.45]^, without looking at its ground truth 
weight category. If we decode this data point using the de¬ 
coding function of “healthy weight”, we get x = [21, 53.8]^, 
which is a legitimate “healthy weight” BMI and weight data 
point. If we use the decoding function of “overweight”, we get 
X = [25.12,62.4]^, which is also a legitimate “overweight” 
BMI and weight data point. Similarly, if we use the decoding 
function of “obese”, we get x = [30.42,69.08]^, which is also 
a legitimate “obese” BMI and weight data point. 

7 Discussion and Future Work 

In this paper, we presented a view on privacy in which the 
data themselves need not be the private object, but rather can 
be used to infer private information. From this point of view, 
we derived a framework that preserves the privacy of the pri¬ 
vate information from being inferred from the communicated 
messages. We provided theoretical analysis and properties of 
the devised framework. An important result (Theorem[T]) pro¬ 
vided conditions that ensure perfect privacy while preserving 
full data utility. We showed that such conditions are achievable 
by providing closed-form solutions to some cases of data gen¬ 
erative models. Theoremfurther showed that perfect privacy 
is not a function of the modeling of the adversary’s auxiliary 
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knowledge about the private information per subject, p(c|s) 
(or p{s)). This observation is important because modeling 
adversary’s auxiliary knowledge is generally a hard problem, 
and because it showed that perfect privacy can be achieved re¬ 
gardless of the adversary’s auxiliary knowledge. That is, the 
same privatization protects information providers from all ad¬ 
versaries, regardless of their auxiliary knowledge. 

Subsequently, we discussed an implementation of the learn¬ 
ing problem resulting from the framework and demonstrated 
its use with a data set published by the Center for Disease Con¬ 
trol and Prevention using data about individuals’ Body Mass 
Indices, weights and their weight status categories. The exper¬ 
imentation shows that after privatizing the data set, the clas¬ 
sification accuracy drops significantly, near a lower bound of 
guaranteed classification accuracy, thus achieving our set goal. 

We make two important remarks about the approach pre¬ 
sented in this paper. First, the described approach is philo¬ 
sophically different from the classical cryptography as it pro¬ 
vides a model where the objective is maintaining the secrecy 
of the private information that is not the data themselves but 
the information that can be inferred based on the data. Sec¬ 
ond, even though the proposed approach is privacy-centric, it 
is not meant to serve as an alternative to cryptography but as a 
complement to it. That said, any message can be “privatized” 
then encrypted. If the encryption is in that case compromised 
by an adversary getting access to the clear text message, the 
privacy is still preserved. 

The current implementation of the devised learning problem 
suffers from the curse of dimensionality. The cost of learning 
grows exponentially with the number of dimensions of the in¬ 
formation space. This is a result of our choice to model p{z\c) 
as a multi-dimensional histogram. To make this framework 
practical, there is a need to study other ways of estimating the 
mutual information measure between the disclosed informa¬ 
tion and the private class. One appealing option is leveraging 
parametric learning and modeling each distribution p{z\c) as 
a mixture model which could result in more computationally 
efficient estimation of the mutual information measure. 

The presented framework has the potential of being ex¬ 
tended to scenarios where the data recipient is not completely 
certain about the private class but is still more certain than the 
adversary. Such scenarios are clearly more general and may 
result in wider applicability of the framework to other scenar¬ 
ios than presented here. Indeed, in such scenarios, commu¬ 
nicated messages can only be interpreted in a statistical sense 
and the implications of such assumptions must be studied as 
well. 

Furthermore, the current implementation of the learning 
problem assumes that adversaries have equal belief about all 
the information providers so that the adversary’s belief about 
C is independent of S and that the generative model of data X 
per private class C is independent of S. This is a simplifying 
assumption and its implications need to be further studied and 
remedied. 


Given the non-convexity and the complexity of the prob¬ 
lem at hand, areas for future research include studying heuris¬ 
tic techniques to learn the privacy mapping functions from 
sufficient and/or necessary conditions for local improvements 
in the mutual information as a function of local changes in 
the privacy mapping functions. This approach, as opposed 
to finding global optimal solutions to Equation Q, is analo¬ 
gous to finding minimal anonymization as opposed to optimal 
anonymization in privacy preserving data publishing (|Fung 

[erari|2mo| . 
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